[RL] pause: use abort pipeline with scheduling loop alive for req drained by jackyYang6 · Pull Request #7753 · PaddlePaddle/FastDeploy

jackyYang6 · 2026-05-08T10:14:55Z

Depends-on: #7615 (refact abort_requests to fire-and-forget)

Motivation

In RL scenarios, the upstream framework calls abort_request followed by pause to stop the engine. The old _control_pause implementation had two critical issues:

Lost partial results: preempted_all() + _send_error_response(500) discarded already-inferred tokens, returning error instead of partial results to clients.
Deadlock with abort pipeline: Setting is_paused=True at the start blocked the scheduling loop (_pause_cond.wait_for), which prevented _trigger_abort from processing abort requests — causing a 30s timeout deadlock.

The new design separates "reject new requests" (_rejecting_new_requests) from "pause scheduling loop" (is_paused), allowing the abort pipeline to complete naturally before engine state reset. This ensures partial inference results are returned to clients via token_processor._put_abort_results (200 "Aborted") through the normal output path.

Modifications

`fastdeploy/engine/common_engine.py`

Change	Description
`self._rejecting_new_requests = False`	New flag in `__init__` to decouple request rejection from scheduling loop pause
`if self.is_paused or self._rejecting_new_requests:`	Request intake check now covers both states
`_control_pause()` rewrite	Two-phase design: (1) reject + abort + drain, (2) pause + reset
`_wait_inflight_drained()`	New method: polls until `resource_manager.requests` is empty

Execution flow

_control_pause:
  ├─ _rejecting_new_requests = True      (block new requests, scheduling loop alive)
  ├─ add_abort_req_ids(ALL)              (scheduling loop processes via _trigger_abort)
  ├─ _wait_inflight_drained()            (poll rm.requests empty)
  ├─ is_paused = True                    (now pause scheduling loop)
  └─cache reset

Usage or Command

# Pause (aborts all inflight requests with partial results, then resets engine)
curl -X POST http://localhost:8180/v1/pause

# Check paused state
curl http://localhost:8180/v1/is_paused

# Resume
curl -X POST http://localhost:8180/v1/resume

Accuracy Tests

Checklist

Add at least a tag in the PR title: [RL], [Engine]
Format your code, run pre-commit before commit.
Add unit tests. (unit test updated for test_control_pause_and_resume_paths)
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch.

paddle-bot · 2026-05-08T10:15:01Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-08T10:34:12Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-13 16:54:42

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 3ef80aa
Merge base: 8396ef6 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

有 1 个 Required 任务失败（Approval 需指定 RD 审批通过），另有 2 个 Required 任务正在运行中，请关注结果。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
40(0)	40	35	2	2	1	0

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	11s	PR问题：新增 llm_logger.info 调用，需指定 RD 审批	请 xyxinyang 或 zyyzghb review 并 approve	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	运行中	-	Job	-
✅	其余 7 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 28/30 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	10m34s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 28 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 代码规范（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 代码规范
置信度: 高
根因摘要: PR问题：新增 llm_logger.info 调用，需指定 RD 审批
分析器: 通用分析(fallback)

根因详情:
PR 在 pause 逻辑中新增了两处 self.llm_logger.info(...) 调用，触发了 FastDeploy 的日志行为审批策略。scripts/check_approval.sh 检测到 diff 中包含日志行为修改（.info/.debug/.error/log_request），要求必须有 xyxinyang(zhouchong) 或 zyyzghb(zhangyongyue) 中至少一人 approve 后才能通过。

关键日志:

Detected log modification in diff:
+        self.llm_logger.info(f"Pause: aborting {len(all_req_ids)} total requests.")
+        self.llm_logger.info(f"All inflight requests drained, take time: ...")
0. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval for modifying logging behavior (.info/.debug/.error/log_request).
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 xyxinyang(zhouchong) 或 zyyzghb(zhangyongyue) 对此 PR 进行 review 并 approve

修复建议摘要: 请 xyxinyang 或 zyyzghb review 并 approve

关联变更: PR 在 pause 逻辑中新增了 self.llm_logger.info(...) 两行调用
链接: 查看日志

codecov-commenter · 2026-05-08T12:41:47Z

Codecov Report

❌ Patch coverage is 57.14286% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8396ef6). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/engine/common_engine.py	57.14%	5 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7753   +/-   ##
==========================================
  Coverage           ?   63.17%           
==========================================
  Files              ?      461           
  Lines              ?    64121           
  Branches           ?     9821           
==========================================
  Hits               ?    40506           
  Misses             ?    20840           
  Partials           ?     2775

Flag	Coverage Δ
GPU	`72.29% <57.14%> (?)`
XPU	`7.13% <7.14%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-13 15:51:59

📋 Review 摘要

PR 概述：重写 _control_pause() 实现两阶段暂停机制：先通过 abort 管道优雅中止所有请求（返回部分结果），再暂停调度循环，解决 RL 场景下 abort+pause 的死锁与丢失 partial result 问题。
变更范围：engine/common_engine.py、engine/sched/resource_manager_v1.py、entrypoints/、router/、相关测试
影响面 Tag：[Engine] [APIServer] [Scheduler]

📝 PR 规范检查

描述结构合规（5 个必填 section 均存在且有内容）。

标题 tag [RL] 是合法官方 Tag，但按 architecture.md 影响面判断表，[RL] 对应 fastdeploy/rl/，本 PR 实际改动集中在 fastdeploy/engine/ 和 fastdeploy/entrypoints/，建议改为 [Engine]（PR checklist 中作者自己也注明了 [RL], [Engine] 两个标签，Engine 更能准确描述变更范围）。

标题建议（可直接复制）：

[Engine] pause: use abort pipeline with scheduling loop alive for graceful pause in RL scenarios

问题

级别	文件	概述
🔴 Bug	`fastdeploy/engine/common_engine.py:1326`	`req_id` 为 None 时警告后未 `continue`，`None` 被写入 `waiting_abort_req_id_set`
🟡 建议	`fastdeploy/engine/common_engine.py:1504`	`_wait_inflight_drained()` 无超时保护，worker 故障时控制线程永久卡死无法恢复
❓ 疑问	`fastdeploy/engine/common_engine.py`	PR 描述 execution flow 列出的 `handle scheduler stragglers`、`_wait_output_queue_empty()`、`scheduler.reset()` 均未在代码中实现，描述与实现存在出入

总体评价

两阶段 pause 设计思路清晰，从根源解决了死锁与 partial result 丢失问题，abort 管道复用路径经过功能验证。存在一处 P0 bug（null req_id 未 continue 导致无效 abort 进入管道），以及无超时保护的潜在运维风险需关注。

Sign in to view

+                                        "Receive abort request without request_id, skip invalid abort message"
+                                    )
+                                self.llm_logger.info(f"Receive abort request, req_id: {req_id}")
+                                self.resource_manager.add_abort_req_ids(req_id)


…l termination Replace the old preempted_all + error_response approach in _control_pause with a two-phase design: Phase 1: Block new requests via _rejecting_new_requests (NOT is_paused) - Scheduling loop keeps running so _trigger_abort can process - add_abort_req_ids(ALL) marks all requests for abort - Scheduling loop catches them via _trigger_abort as they cycle through Phase 2: After drain, set is_paused=True to fully stop scheduling loop - Handle scheduler-only stragglers with direct _send_error_response - Wait for output queue empty, then reset Depends-on: PaddlePaddle#7615 (refact abort_requests to fire-and-forget)

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-13 16:20:55

📋 Review 摘要

PR 概述：重写 _control_pause() 实现，解决 RL 场景下 abort pipeline 死锁和丢弃 partial result 问题。
变更范围：fastdeploy/engine/common_engine.py、tests/engine/test_common_engine.py
影响面 Tag：[Engine] [RL]

📝 PR 规范检查

PR 描述结构完整（Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 全部存在），Checklist 中 pre-commit 未勾选，请在合入前补充执行。

问题

级别	文件	概述
🟡 建议	`fastdeploy/engine/common_engine.py:1491`	`_wait_inflight_drained()` 无超时，abort pipeline 卡死时永久阻塞
❓ 疑问	`fastdeploy/engine/common_engine.py:1463`	PR 执行流描述了 `scheduler.reset()`，但代码中缺失；`scheduler.responses` 可能有残留
❓ 疑问	`fastdeploy/engine/common_engine.py`	PR Modifications 表格声明了新方法 `_wait_output_queue_empty()`，但 diff 中完全没有该方法实现
❓ 疑问	`fastdeploy/engine/common_engine.py`	`token_processor.clear_data()` 被移除，无注释说明 abort pipeline 是否覆盖了其清理职责

总体评价

整体设计思路清晰，两阶段分离（reject 与 pause）有效解决了死锁问题，accuracy tests 通过验证了功能正确性。但存在 PR 描述与实际实现的多处不一致（scheduler.reset()、_wait_output_queue_empty() 描述有但代码没有），建议作者补充说明或补齐实现；_wait_inflight_drained() 的无超时设计建议增加兜底保护。

PaddlePaddle-bot · 2026-05-13T08:27:08Z

+        No timeout — abort pipeline will complete. Aligned with SGLang's poll-until-drained.
+        """
+        start_time = time.time()
+        while (


🟡 建议 _wait_inflight_drained() 无超时机制，可能导致永久阻塞

原代码对 worker queue 等待有 60s 超时并 raise Exception，新设计完全去掉了超时保护。注释说 "No timeout — abort pipeline will complete"，但若 abort pipeline 因 bug 或异常卡住（如 worker hang、ZMQ 消息丢失），_control_pause() 将永远阻塞，上游 RL 框架调用方无法感知，造成静默挂起。

建议添加兜底超时：

DRAIN_TIMEOUT = 120 start_time = time.time() while (self.resource_manager.requests or self.scheduler.requests or self.resource_manager.waiting_abort_req_id_set or self.resource_manager.to_be_aborted_req_id_set): if time.time() - start_time > DRAIN_TIMEOUT: self.llm_logger.error(f"Drain timed out after {DRAIN_TIMEOUT}s, abort pipeline may have stalled!") raise TimeoutError(f"_wait_inflight_drained timed out after {DRAIN_TIMEOUT}s") time.sleep(0.005)

PaddlePaddle-bot · 2026-05-13T08:27:08Z

-            self._send_error_response(req.request_id, "Request is aborted since engine is paused.")
-        self.scheduler.reset()
-
        if envs.ENABLE_V1_KVCACHE_MANAGER:


❓ 疑问 PR 执行流描述了 scheduler.reset()，但此处代码缺失

PR 描述的执行流程末尾明确写有 scheduler.reset() + cache reset，但实际只有 cache reset，self.scheduler.reset() 未被调用。

查看 local_scheduler.reset()（line 115-119）的实现，它会清空：

ids_read_cursor（重置为 0）

ids（所有历史请求 ID 列表）

requests（待处理请求字典）

responses（已接收响应字典）

_wait_inflight_drained() 只检查 requests 和 abort 队列为空，不检查 responses。若 scheduler.responses 中有残留未消费数据，resume 后可能产生状态不一致。

请确认：省略 scheduler.reset() 是有意为之（abort pipeline 的正常路径已保证 responses 被消费完），还是遗漏实现？

jackyYang6 had a problem deploying to Metax_ci May 8, 2026 10:15 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

jackyYang6 force-pushed the rl/pause-abort branch from 0545bde to 0637c1e Compare May 8, 2026 11:08

jackyYang6 had a problem deploying to Metax_ci May 8, 2026 11:08 — with GitHub Actions Error

jackyYang6 force-pushed the rl/pause-abort branch from 0637c1e to 8ea07fe Compare May 8, 2026 11:08

jackyYang6 had a problem deploying to Metax_ci May 8, 2026 11:08 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

jackyYang6 had a problem deploying to Metax_ci May 11, 2026 08:52 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

jackyYang6 force-pushed the rl/pause-abort branch from 8fec3b3 to 7d634d5 Compare May 13, 2026 07:44

jackyYang6 had a problem deploying to Metax_ci May 13, 2026 07:45 — with GitHub Actions Failure

jackyYang6 force-pushed the rl/pause-abort branch from 7d634d5 to 8ea07fe Compare May 13, 2026 07:54

jackyYang6 had a problem deploying to Metax_ci May 13, 2026 07:54 — with GitHub Actions Error

PaddlePaddle-bot suggested changes May 13, 2026

View reviewed changes

jackyYang6 force-pushed the rl/pause-abort branch from 8ea07fe to ba85acb Compare May 13, 2026 08:00

jackyYang6 had a problem deploying to Metax_ci May 13, 2026 08:01 — with GitHub Actions Error

jackyYang6 force-pushed the rl/pause-abort branch from ba85acb to 3ef80aa Compare May 13, 2026 08:02

jackyYang6 had a problem deploying to Metax_ci May 13, 2026 08:02 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

PaddlePaddle-bot reviewed May 13, 2026

View reviewed changes

jackyYang6 changed the title ~~[RL] pause: use abort pipeline with scheduling loop alive for gracefu…~~ [RL] pause: use abort pipeline with scheduling loop alive for req drained May 13, 2026

Conversation

jackyYang6 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

fastdeploy/engine/common_engine.py

Execution flow

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 8, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 7/10 通过

2.2 可选任务 — 28/30 通过

3 失败详情（仅 required）

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jackyYang6 commented May 8, 2026 •

edited

Loading

`fastdeploy/engine/common_engine.py`

PaddlePaddle-bot commented May 8, 2026 •

edited

Loading

codecov-commenter commented May 8, 2026 •

edited

Loading